13 research outputs found
HALLS: An Energy-Efficient Highly Adaptable Last Level STT-RAM Cache for Multicore Systems
Spin-Transfer Torque RAM (STT-RAM) is widely considered a promising
alternative to SRAM in the memory hierarchy due to STT-RAM's non-volatility,
low leakage power, high density, and fast read speed. The STT-RAM's small
feature size is particularly desirable for the last-level cache (LLC), which
typically consumes a large area of silicon die. However, long write latency and
high write energy still remain challenges of implementing STT-RAMs in the CPU
cache. An increasingly popular method for addressing this challenge involves
trading off the non-volatility for reduced write speed and write energy by
relaxing the STT-RAM's data retention time. However, in order to maximize
energy saving potential, the cache configurations, including STT-RAM's
retention time, must be dynamically adapted to executing applications' variable
memory needs. In this paper, we propose a highly adaptable last level STT-RAM
cache (HALLS) that allows the LLC configurations and retention time to be
adapted to applications' runtime execution requirements. We also propose
low-overhead runtime tuning algorithms to dynamically determine the best
(lowest energy) cache configurations and retention times for executing
applications. Compared to prior work, HALLS reduced the average energy
consumption by 60.57% in a quad-core system, while introducing marginal latency
overhead.Comment: To Appear on IEEE Transactions on Computers (TC
A Survey of Phase Classification Techniques for Characterizing Variable Application Behavior
Adaptable computing is an increasingly important paradigm that specializes
system resources to variable application requirements, environmental
conditions, or user requirements. Adapting computing resources to variable
application requirements (or application phases) is otherwise known as
phase-based optimization. Phase-based optimization takes advantage of
application phases, or execution intervals of an application, that behave
similarly, to enable effective and beneficial adaptability. In order for
phase-based optimization to be effective, the phases must first be classified
to determine when application phases begin and end, and ensure that system
resources are accurately specialized. In this paper, we present a survey of
phase classification techniques that have been proposed to exploit the
advantages of adaptable computing through phase-based optimization. We focus on
recent techniques and classify these techniques with respect to several factors
in order to highlight their similarities and differences. We divide the
techniques by their major defining characteristics---online/offline and
serial/parallel. In addition, we discuss other characteristics such as
prediction and detection techniques, the characteristics used for prediction,
interval type, etc. We also identify gaps in the state-of-the-art and discuss
future research directions to enable and fully exploit the benefits of
adaptable computing.Comment: To appear in IEEE Transactions on Parallel and Distributed Systems
(TPDS
Recommended from our members
MirrorCache: An Energy-Efficient Relaxed Retention L1 STTRAM Cache
Spin-Transfer Torque RAM (STTRAM) is a promising alternative to SRAMs in on-chip caches, due to several advantages, including non-volatility, low leakage, high integration density, and CMOS compatibility. However, STTRAMs' wide adoption in resource-constrained systems is impeded, in part, by high write energy and latency. A popular approach to mitigating these overheads involves relaxing the STTRAM's retention time, in order to reduce the write latency and energy. However, this approach usually requires a dynamic refresh scheme to maintain cache blocks' data integrity beyond the retention time, and typically requires an external refresh buffer. In this paper, we propose mirrorCache-an energy-efficient, buffer-free refresh scheme. MirrorCache leverages the STTRAM cell's compact feature size, and uses an auxiliary segment with the same size as the logical cache size to handle the refresh operations without the overheads of an external refresh buffer. Our experiments show that, compared to prior work, mirrorCache can reduce the average cache energy by at least 39.7% for a variety of systems.This item from the UA Faculty Publications collection is made available by the University of Arizona with support from the University of Arizona Libraries. If you have questions, please contact us at [email protected]
Design Space Exploration of Sparsity-Aware Application-Specific Spiking Neural Network Accelerators
Spiking Neural Networks (SNNs) offer a promising alternative to Artificial
Neural Networks (ANNs) for deep learning applications, particularly in
resource-constrained systems. This is largely due to their inherent sparsity,
influenced by factors such as the input dataset, the length of the spike train,
and the network topology. While a few prior works have demonstrated the
advantages of incorporating sparsity into the hardware design, especially in
terms of reducing energy consumption, the impact on hardware resources has not
yet been explored. This is where design space exploration (DSE) becomes
crucial, as it allows for the optimization of hardware performance by tailoring
both the hardware and model parameters to suit specific application needs.
However, DSE can be extremely challenging given the potentially large design
space and the interplay of hardware architecture design choices and
application-specific model parameters.
In this paper, we propose a flexible hardware design that leverages the
sparsity of SNNs to identify highly efficient, application-specific accelerator
designs. We develop a high-level, cycle-accurate simulation framework for this
hardware and demonstrate the framework's benefits in enabling detailed and
fine-grained exploration of SNN design choices, such as the layer-wise
logical-to-hardware ratio (LHR). Our experimental results show that our design
can (i) achieve up to reduction in hardware resources and (ii) deliver a
speed increase of up to , while requiring fewer hardware
resources compared to sparsity-oblivious designs. We further showcase the
robustness of our framework by varying spike train lengths with different
neuron population sizes to find the optimal trade-off points between accuracy
and hardware latency